-
Notifications
You must be signed in to change notification settings - Fork 1.1k
PYTHON-5536 Avoid clearing the connection pool when the server connection rate limiter triggers #2509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: backpressure
Are you sure you want to change the base?
Conversation
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…b#2507) Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…ction rate limiter triggers
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Steven Silvester <[email protected]>
pymongo/asynchronous/pool.py
Outdated
conn.conn.get_conn.read(1) | ||
except Exception as _: | ||
# TODO: verify the exception | ||
close_conn = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 comments:
- I believe this logic needs to move to connection checkout. Here in connection check in we already know the connection is useable because we're checking it back in after a successful command.
- Instead of a 1ms read can we reuse the existing _perished() + conn_closed() methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
(cherry picked from commit 0d4c84e)
pymongo/asynchronous/pool.py
Outdated
if not self.is_sdam and type(e) == AutoReconnect: | ||
self._backoff += 1 | ||
e._add_error_label("SystemOverloaded") | ||
e._add_error_label("Retryable") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to move this logic so that it covers the TCP+TLS handshake which happen up above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I set a breakpoint in the TCP+TLS handshake error handler and confirmed that handshakes are succeeding. The error only occurs on hello/auth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I'm actually surprised by this since the design SPM-4319 indicates the rate limiter rejection happens before the TLS handshake.
Ideally we'd like to detect |
else: | ||
if self._closing_exception: | ||
raise self._closing_exception | ||
if self._closed.done(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is calling is_closing
here better? It'll catch more edge cases in theory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm let me try that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it is ambiguous as to whether connection_lost
as been called yet. Since connection_lost
is synchronous, checking for self._closed.done()
assures that we have actually lost the connection.
pymongo/asynchronous/pool.py
Outdated
): | ||
self._backoff += 1 | ||
error._add_error_label("SystemOverloaded") | ||
error._add_error_label("Retryable") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you merge backpressure? Originally I added the incorrect labels here. It should be "SystemOverloadedError" and "RetryableError"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
pymongo/asynchronous/pool.py
Outdated
self._backoff += 1 | ||
error._add_error_label("SystemOverloaded") | ||
error._add_error_label("Retryable") | ||
print(f"Setting backoff in {phase}:", self._backoff) # noqa: T201 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of inspecting the error message after the fact, is it possible we can record some state to determine if the error happened during DNS+TCP or after? Like:
# Assume all non dns/tcp/timeout errors mean the server rejected the connection due to overload.
if not errorDuringDnsTcp and not timeoutError:
error._add_error_label("SystemOverloadedError")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DNS is already resolved by the time we make a pool as far as I can tell, and we can't distinguish between TCP connection and TLS handshake for async.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some logic in 7548f7b that looks for a specific error attached to AutoReconnect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For discussion: we can still run into the condition were we hit this line, and there is no closing error, and we have not yet received any data on the protocol. I verified this by setting a flag when buffer_updated is called. We don't have a way to ascribe more semantic meaning to this condition as far as I can tell from the Protocol/Transport docs and the available properties on each.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem got even worse when using gevent, since the error was something completely different, so I reverted any extra handling of the connection error.
Currently testing with this script for async:
and this one for sync: